首先我们读入数据:

data <- read.csv(file = "/Users/andreolli/Desktop/BodyFat.csv")
data <- data[, -c(1,3)]

第182行的BODYFAT为0显然是错误的,我们在之后的分析中去掉第182行数据。

用str函数可以查看data各列的数据类型

str(data)
## 'data.frame':    252 obs. of  15 variables:
##  $ BODYFAT  : num  12.6 6.9 24.6 10.9 27.8 20.6 19 12.8 5.1 12 ...
##  $ AGE      : int  23 22 22 26 24 24 26 25 25 23 ...
##  $ WEIGHT   : num  154 173 154 185 184 ...
##  $ HEIGHT   : num  67.8 72.2 66.2 72.2 71.2 ...
##  $ ADIPOSITY: num  23.7 23.4 24.7 24.9 25.6 26.5 26.2 23.6 24.6 25.8 ...
##  $ NECK     : num  36.2 38.5 34 37.4 34.4 39 36.4 37.8 38.1 42.1 ...
##  $ CHEST    : num  93.1 93.6 95.8 101.8 97.3 ...
##  $ ABDOMEN  : num  85.2 83 87.9 86.4 100 94.4 90.7 88.5 82.5 88.6 ...
##  $ HIP      : num  94.5 98.7 99.2 101.2 101.9 ...
##  $ THIGH    : num  59 58.7 59.6 60.1 63.2 66 58.4 60 62.9 63.1 ...
##  $ KNEE     : num  37.3 37.3 38.9 37.3 42.2 42 38.3 39.4 38.3 41.7 ...
##  $ ANKLE    : num  21.9 23.4 24 22.8 24 25.6 22.9 23.2 23.8 25 ...
##  $ BICEPS   : num  32 30.5 28.8 32.4 32.2 35.7 31.9 30.5 35.9 35.6 ...
##  $ FOREARM  : num  27.4 28.9 25.2 29.4 27.7 30.6 27.8 29 31.1 30 ...
##  $ WRIST    : num  17.1 18.2 16.6 18.2 17.7 18.8 17.7 18.8 18.2 19.2 ...

我们发现除了变量AGE为int型以外,其他变量都为浮点型数据,下面我们首先要考虑的是AGE是否需要作为因子型变量加入到模型中。我们首先画出AGE对BODYFAT的图:

plot(y= data$BODYFAT, x = data$AGE)

从图中我们可以近似看到AGE与BODYFAT近似有线性关系,这提示我们AGE不需要作为因子型变量出现,下面我们更加精确的验证这一个论断:

model.age <- lm(BODYFAT ~ AGE, data = data[-182,])
summary(model.age)
## 
## Call:
## lm(formula = BODYFAT ~ AGE, data = data[-182, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.3744  -5.7732   0.2424   4.8411  25.0155 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.12734    1.71990   6.470 5.17e-10 ***
## AGE          0.17563    0.03688   4.763 3.25e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.361 on 249 degrees of freedom
## Multiple R-squared:  0.08349,    Adjusted R-squared:  0.07981 
## F-statistic: 22.68 on 1 and 249 DF,  p-value: 3.248e-06
plot(model.age)

异常点为216, 192, 36

hat.plot <- function(fit){
  p <- length(coefficients(fit))
  n <- length(fitted(fit))
  plot(hatvalues(fit),main = "Index Plot of Hat Values")
  abline(h=c(2,3)*p/n,col="red",lty=2)
  identify(1:n, hatvalues(fit), names(hatvalues(fit)))  
  #这句产生交互效果,选中某个点后,关闭后返回点的名称
}
hat.plot(model.age)

## integer(0)

强影响力点为79, 252. 我们去掉异常点再建立一次模型

data.age <- data[-c(79, 252, 36, 216, 182, 192),]
model.age.2 <- lm(BODYFAT ~ AGE, data = data.age)
summary(model.age.2)
## 
## Call:
## lm(formula = BODYFAT ~ AGE, data = data.age)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.1745  -5.5530   0.3672   4.9233  17.6255 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 11.14374    1.68244   6.624 2.22e-10 ***
## AGE          0.16945    0.03634   4.662 5.15e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.048 on 244 degrees of freedom
## Multiple R-squared:  0.0818, Adjusted R-squared:  0.07804 
## F-statistic: 21.74 on 1 and 244 DF,  p-value: 5.15e-06
plot(model.age.2)

现在我们可以看到AGE依然显著,同时残差图没有模式,qq图显示正态性假设近似成立,同时没有明显异常点,这说明我们确实可以将AGE直接加入模型中。

下面, 我们尝试建立BODYFAT对于其他所有自变量的线性回归模型,查看其效果:

model <- lm(BODYFAT ~ ., data = data[-182, ])
summary(model)
## 
## Call:
## lm(formula = BODYFAT ~ ., data = data[-182, ])
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -10.3096  -2.6290  -0.1517   2.8910   9.2215 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -12.228338  16.220362  -0.754  0.45167    
## AGE           0.055629   0.029984   1.855  0.06480 .  
## WEIGHT       -0.075461   0.049969  -1.510  0.13234    
## HEIGHT       -0.057496   0.103293  -0.557  0.57831    
## ADIPOSITY     0.079041   0.277737   0.285  0.77621    
## NECK         -0.441507   0.217997  -2.025  0.04396 *  
## CHEST        -0.049292   0.098514  -0.500  0.61729    
## ABDOMEN       0.879418   0.085291  10.311  < 2e-16 ***
## HIP          -0.208197   0.136861  -1.521  0.12954    
## THIGH         0.206584   0.136134   1.518  0.13048    
## KNEE         -0.001488   0.229368  -0.006  0.99483    
## ANKLE         0.143953   0.207393   0.694  0.48830    
## BICEPS        0.160964   0.159976   1.006  0.31536    
## FOREARM       0.414567   0.184830   2.243  0.02583 *  
## WRIST        -1.494990   0.495848  -3.015  0.00285 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.988 on 236 degrees of freedom
## Multiple R-squared:  0.745,  Adjusted R-squared:  0.7298 
## F-statistic: 49.24 on 14 and 236 DF,  p-value: < 2.2e-16
plot(model)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

我们首先检验模型中是否存在多重共线性:

require(car)
## Loading required package: car
vif(model)
##       AGE    WEIGHT    HEIGHT ADIPOSITY      NECK     CHEST   ABDOMEN 
##  2.251601 33.455026  2.255654 15.937139  4.378607 10.600206 13.100684 
##       HIP     THIGH      KNEE     ANKLE    BICEPS   FOREARM     WRIST 
## 14.906809  7.885730  4.742794  1.926817  3.652422  2.165475  3.334965

我们发现模型中存在着严重的多重共线性问题,同时图形显示模型中有明显的异常值点,需要去掉。由于多重共线性不会影响异常值点的诊断,我们首先去掉异常点。明显的,我们首先去掉点39, 207, 224, 再检验高杠杆值点:

hat.plot(model)

## integer(0)

为 31 36 39 41 42 86 159 175 206,去掉现有异常点在此建立模型:

data2 <- data[-c(182, 31, 36, 39, 41, 42, 86, 159, 175, 206, 207, 297, 224),]
model2 <- lm(BODYFAT ~ ., data = data2)
plot(model2)

检验高杠杆值点:

hat.plot(model2)

## integer(0)

为49,155,205,210 , 216, 221, 54, 163(54, 163为明显的在residual vs leverage图中未标出来的靠右的点)

去掉现有异常点并在此建立模型:

data3 <- data[-c(182, 31, 36, 39, 41, 42, 86, 106, 159, 175, 206, 207, 297, 224, 49, 155, 205, 210, 216, 230, 216, 221, 54, 163),]
model3 <- lm(BODYFAT ~ ., data = data3)
plot(model3)

我们可以看出在模型中没有明显的强杠杆值点和异常点了,但是正态qq图显示正态性假定可能不满足,我们检验这一点:

shapiro.test(model3$residuals) 
## 
##  Shapiro-Wilk normality test
## 
## data:  model3$residuals
## W = 0.99011, p-value = 0.118

检验显示正态性假定未满足,但是偏离程度比较低。我们先尝试进行box-cox变换:

data.trans <- data[-c(182, 31, 36, 39, 41, 42, 86, 106, 159, 175, 206, 207, 297, 224, 49, 155, 205, 210, 216, 230, 216, 221, 54, 163),]
summary(p1 <- powerTransform(data.trans$BODYFAT))
## bcPower Transformation to Normality 
##                    Est Power Rounded Pwr Wald Lwr bnd Wald Upr Bnd
## data.trans$BODYFAT    0.9068           1       0.6327       1.1809
## 
## Likelihood ratio tests about transformation parameters
##                              LRT df         pval
## LR test, lambda = (0) 51.0480684  1 9.012791e-13
## LR test, lambda = (1)  0.4368302  1 5.086564e-01

选取power为0.9

model.trans <- lm(BODYFAT^0.9 ~ ., data = data3)
plot(model.trans)

发现经过box-cox变换后残差依然不满足正态性假定。但是,在线性回归分析中我们学过,当独立性假定,线性性假定, 同方差性假定满足,并且误差项分布与正态分布比较相近时,线性回归模型依然是稳健的。我们下面首先验证前面的各个假定,其中所用的方法在R语言实战书中有:

独立性假定

durbinWatsonTest(model3)
##  lag Autocorrelation D-W Statistic p-value
##    1     -0.01098102      2.013511    0.97
##  Alternative hypothesis: rho != 0

Durbin-Watson检验的p值非常高,可以看到独立性假定很好的满足了。

同方差性假定

ncvTest(model3)
## Non-constant Variance Score Test 
## Variance formula: ~ fitted.values 
## Chisquare = 0.02290288    Df = 1     p = 0.8797099

计分检验p值非常高,同方差性满足

线性性假定和误差项的偏度,峰度与正态偏度峰度相近,所用函数在中文版R语言实战第181页

require(gvlma)
## Loading required package: gvlma
gvmodel <- gvlma(model3)
summary(gvmodel)
## 
## Call:
## lm(formula = BODYFAT ~ ., data = data3)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -8.7371 -2.7499 -0.1166  2.7631  8.5008 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -68.65657   57.64641  -1.191   0.2350    
## AGE           0.06772    0.03128   2.165   0.0315 *  
## WEIGHT       -0.21067    0.16190  -1.301   0.1946    
## HEIGHT        0.87035    0.80946   1.075   0.2835    
## ADIPOSITY     1.39301    1.15450   1.207   0.2289    
## NECK         -0.47384    0.24490  -1.935   0.0543 .  
## CHEST        -0.12804    0.10877  -1.177   0.2404    
## ABDOMEN       0.84775    0.09059   9.358   <2e-16 ***
## HIP          -0.26373    0.15619  -1.689   0.0928 .  
## THIGH         0.23821    0.14259   1.671   0.0963 .  
## KNEE         -0.05898    0.25266  -0.233   0.8157    
## ANKLE         0.12921    0.34548   0.374   0.7088    
## BICEPS        0.25931    0.18956   1.368   0.1728    
## FOREARM       0.12608    0.33270   0.379   0.7051    
## WRIST        -1.22292    0.58205  -2.101   0.0368 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.873 on 215 degrees of freedom
## Multiple R-squared:  0.7326, Adjusted R-squared:  0.7152 
## F-statistic: 42.08 on 14 and 215 DF,  p-value: < 2.2e-16
## 
## 
## ASSESSMENT OF THE LINEAR MODEL ASSUMPTIONS
## USING THE GLOBAL TEST ON 4 DEGREES-OF-FREEDOM:
## Level of Significance =  0.05 
## 
## Call:
##  gvlma(x = model3) 
## 
##                       Value p-value                Decision
## Global Stat        7.488862 0.11220 Assumptions acceptable.
## Skewness           0.003326 0.95401 Assumptions acceptable.
## Kurtosis           3.699545 0.05443 Assumptions acceptable.
## Link Function      3.728642 0.05349 Assumptions acceptable.
## Heteroscedasticity 0.057350 0.81073 Assumptions acceptable.

可以看到线性性假定满足,同时残差偏度峰度与正态分布偏度峰度类似,结合之前正态性检验的p值为0.11,p值不大,说明线性模型比较稳健,我们可以使用线性模型。

下面需要处理多重共线性问题。我们考虑四种不同的方法:1.lasso regression, 2.主成分分析 3.偏最小二乘。之后我们用k-fold交叉检验方法从中选出最佳的模型。论文显示k可以取5.

首先我们考虑lasso regression模型,用glmnet包(lasso法发明人写的包),介绍在https://cosx.org/2016/10/data-mining-1-lasso/

require(glmnet)
## Loading required package: glmnet
## Loading required package: Matrix
## Loading required package: foreach
## Loaded glmnet 2.0-13
y = as.matrix(data3[,1])
x = as.matrix(data3[,-1])
fit = cv.glmnet(x = x, y = y, family = "gaussian", nlambda = 100, alpha = 1, standardize=TRUE, nfolds = 10)
print(fit$lambda.1se)
## [1] 0.4771904
plot(fit)

cv.glmnet函数自带了交叉检验选择最佳的lambda,这里我们可以看到使用lasso方法时我们保留四个变量,同时可以设置lambda值为0.4。下面我们确定保留哪些变量:

fit2 = glmnet(x = x, y = y, family = "gaussian", nlambda = 100, alpha = 1, standardize=TRUE)
plot(fit2, xvar = "lambda", label = TRUE)

可以看到第1, 3, 7, 14个自变量被保留,我们把对应的数据存入data.lasso中并拟合模型:

data.lasso <- data3[,c(1,2,4,8,15)]
model.lasso <- lm(BODYFAT ~., data = data.lasso)
summary(model.lasso)
## 
## Call:
## lm(formula = BODYFAT ~ ., data = data.lasso)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.0463 -2.9007 -0.2994  3.0831  8.2874 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.94653    7.91244   0.120   0.9049    
## AGE          0.05118    0.02290   2.235   0.0264 *  
## HEIGHT      -0.27036    0.11968  -2.259   0.0248 *  
## ABDOMEN      0.70219    0.03462  20.282  < 2e-16 ***
## WRIST       -1.65347    0.41478  -3.986 9.07e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.927 on 225 degrees of freedom
## Multiple R-squared:  0.7124, Adjusted R-squared:  0.7072 
## F-statistic: 139.3 on 4 and 225 DF,  p-value: < 2.2e-16
plot(model.lasso)

vif(model.lasso)
##      AGE   HEIGHT  ABDOMEN    WRIST 
## 1.292792 1.421544 1.666044 1.947444

可以看到模型的各项假定除了正态性假定仍有轻微偏离以外其他假定都很好的满足了,同时各个变量都显著,且方差膨胀因子足够小,多重共线性问题解决了。

对数据标准化以便建立主成分分析模型和偏最小二乘模型:

data.scaled <- as.data.frame(scale(data3, center = TRUE, scale = TRUE))

下面考虑主成分分析模型,所用函数介绍在R语言实战第十四章,要安装psych包:

require(psych)
## Loading required package: psych
## 
## Attaching package: 'psych'
## The following object is masked from 'package:car':
## 
##     logit

fa.parallel函数可以确定主成分的个数:

fa.parallel(data.scaled[,-1], fa = "pc", n.iter = 1000, show.legend = FALSE, main = "Scree plot with parallel analysis")
## The estimated weights for the factor scores are probably incorrect.  Try a different factor extraction method.

## Parallel analysis suggests that the number of factors =  NA  and the number of components =  2

碎石图展示了R语言实战第三版中的三种不同判别主成分个数的法则所推荐的主成分个数,这里第二, 第三种准则都推荐使用2个主成分。第一种准则可推荐2, 3,4, 5个主成分,下面分别对不同主成分数量提取主成分并进行测试:

principal(data.scaled[,-1], nfactors = 2)
## Principal Components Analysis
## Call: principal(r = data.scaled[, -1], nfactors = 2)
## Standardized loadings (pattern matrix) based upon correlation matrix
##            RC1   RC2   h2    u2 com
## AGE       0.23 -0.84 0.76 0.242 1.2
## WEIGHT    0.94  0.29 0.96 0.036 1.2
## HEIGHT    0.33  0.67 0.57 0.433 1.5
## ADIPOSITY 0.94 -0.06 0.88 0.115 1.0
## NECK      0.86  0.05 0.75 0.254 1.0
## CHEST     0.93 -0.05 0.87 0.133 1.0
## ABDOMEN   0.92 -0.10 0.86 0.139 1.0
## HIP       0.88  0.31 0.87 0.133 1.2
## THIGH     0.79  0.41 0.79 0.211 1.5
## KNEE      0.83  0.30 0.79 0.213 1.3
## ANKLE     0.67  0.48 0.67 0.331 1.8
## BICEPS    0.82  0.22 0.71 0.285 1.1
## FOREARM   0.81  0.30 0.74 0.256 1.3
## WRIST     0.80  0.03 0.64 0.361 1.0
## 
##                        RC1  RC2
## SS loadings           8.88 1.98
## Proportion Var        0.63 0.14
## Cumulative Var        0.63 0.78
## Proportion Explained  0.82 0.18
## Cumulative Proportion 0.82 1.00
## 
## Mean item complexity =  1.2
## Test of the hypothesis that 2 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.07 
##  with the empirical chi square  191.73  with prob <  1.1e-14 
## 
## Fit based upon off diagonal values = 0.99
principal(data.scaled[,-1], nfactors = 3)
## Principal Components Analysis
## Call: principal(r = data.scaled[, -1], nfactors = 3)
## Standardized loadings (pattern matrix) based upon correlation matrix
##            RC1   RC3   RC2   h2    u2 com
## AGE       0.03 -0.14  0.95 0.92 0.081 1.0
## WEIGHT    0.86  0.48  0.03 0.96 0.035 1.6
## HEIGHT    0.11  0.92 -0.15 0.87 0.127 1.1
## ADIPOSITY 0.98  0.02  0.13 0.97 0.026 1.0
## NECK      0.75  0.38  0.25 0.76 0.239 1.7
## CHEST     0.88  0.20  0.25 0.87 0.129 1.3
## ABDOMEN   0.88  0.14  0.27 0.87 0.129 1.2
## HIP       0.86  0.37 -0.07 0.88 0.121 1.4
## THIGH     0.86  0.26 -0.28 0.88 0.120 1.4
## KNEE      0.73  0.51  0.03 0.80 0.203 1.8
## ANKLE     0.56  0.60 -0.12 0.69 0.307 2.1
## BICEPS    0.83  0.24 -0.06 0.75 0.249 1.2
## FOREARM   0.75  0.42 -0.03 0.74 0.256 1.6
## WRIST     0.57  0.56  0.41 0.81 0.194 2.8
## 
##                        RC1  RC3  RC2
## SS loadings           7.72 2.65 1.41
## Proportion Var        0.55 0.19 0.10
## Cumulative Var        0.55 0.74 0.84
## Proportion Explained  0.66 0.22 0.12
## Cumulative Proportion 0.66 0.88 1.00
## 
## Mean item complexity =  1.5
## Test of the hypothesis that 3 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.05 
##  with the empirical chi square  89.14  with prob <  0.001 
## 
## Fit based upon off diagonal values = 0.99
principal(data.scaled[,-1], nfactors = 4)
## Principal Components Analysis
## Call: principal(r = data.scaled[, -1], nfactors = 4)
## Standardized loadings (pattern matrix) based upon correlation matrix
##            RC1  RC4   RC3   RC2   h2    u2 com
## AGE       0.05 0.03 -0.14  0.95 0.93 0.066 1.1
## WEIGHT    0.76 0.47  0.42  0.01 0.97 0.026 2.3
## HEIGHT    0.11 0.15  0.92 -0.13 0.90 0.102 1.1
## ADIPOSITY 0.86 0.47 -0.04  0.09 0.97 0.025 1.6
## NECK      0.48 0.72  0.23  0.17 0.84 0.164 2.1
## CHEST     0.78 0.45  0.14  0.22 0.88 0.124 1.9
## ABDOMEN   0.87 0.30  0.13  0.27 0.94 0.065 1.5
## HIP       0.84 0.31  0.36 -0.07 0.93 0.069 1.7
## THIGH     0.79 0.35  0.23 -0.30 0.89 0.108 1.9
## KNEE      0.72 0.30  0.50  0.04 0.85 0.148 2.2
## ANKLE     0.48 0.37  0.56 -0.14 0.70 0.304 2.9
## BICEPS    0.57 0.70  0.09 -0.15 0.84 0.156 2.1
## FOREARM   0.43 0.81  0.24 -0.13 0.92 0.083 1.8
## WRIST     0.32 0.70  0.41  0.33 0.87 0.131 2.6
## 
##                        RC1  RC4  RC3  RC2
## SS loadings           5.60 3.36 2.12 1.35
## Proportion Var        0.40 0.24 0.15 0.10
## Cumulative Var        0.40 0.64 0.79 0.89
## Proportion Explained  0.45 0.27 0.17 0.11
## Cumulative Proportion 0.45 0.72 0.89 1.00
## 
## Mean item complexity =  1.9
## Test of the hypothesis that 4 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.03 
##  with the empirical chi square  39.95  with prob <  0.52 
## 
## Fit based upon off diagonal values = 1
principal(data.scaled[,-1], nfactors = 5)
## Principal Components Analysis
## Call: principal(r = data.scaled[, -1], nfactors = 5)
## Standardized loadings (pattern matrix) based upon correlation matrix
##            RC1  RC4   RC5   RC3   RC2   h2     u2 com
## AGE       0.06 0.02 -0.05 -0.12  0.96 0.93 0.0658 1.0
## WEIGHT    0.76 0.45  0.28  0.35  0.01 0.98 0.0170 2.4
## HEIGHT    0.14 0.17  0.19  0.94 -0.14 0.99 0.0076 1.2
## ADIPOSITY 0.84 0.44  0.23 -0.15  0.09 0.98 0.0199 1.8
## NECK      0.49 0.71  0.20  0.18  0.18 0.84 0.1565 2.2
## CHEST     0.78 0.44  0.17  0.09  0.22 0.88 0.1181 1.9
## ABDOMEN   0.87 0.29  0.14  0.09  0.27 0.95 0.0549 1.5
## HIP       0.83 0.30  0.25  0.29 -0.08 0.94 0.0632 1.7
## THIGH     0.78 0.33  0.27  0.13 -0.30 0.89 0.1077 2.0
## KNEE      0.68 0.26  0.47  0.33  0.03 0.87 0.1320 2.6
## ANKLE     0.39 0.27  0.80  0.22 -0.15 0.94 0.0573 2.0
## BICEPS    0.58 0.71  0.07  0.08 -0.14 0.87 0.1298 2.1
## FOREARM   0.42 0.80  0.24  0.17 -0.13 0.92 0.0813 1.9
## WRIST     0.29 0.64  0.49  0.23  0.33 0.90 0.0987 3.3
## 
##                        RC1  RC4  RC5  RC3  RC2
## SS loadings           5.43 3.10 1.56 1.44 1.36
## Proportion Var        0.39 0.22 0.11 0.10 0.10
## Cumulative Var        0.39 0.61 0.72 0.82 0.92
## Proportion Explained  0.42 0.24 0.12 0.11 0.11
## Cumulative Proportion 0.42 0.66 0.78 0.89 1.00
## 
## Mean item complexity =  2
## Test of the hypothesis that 5 components are sufficient.
## 
## The root mean square of the residuals (RMSR) is  0.02 
##  with the empirical chi square  20.06  with prob <  0.93 
## 
## Fit based upon off diagonal values = 1

u2行表示了方差无法被主成分表示的比例, 使用三个主成分而不是两个时height变量无法被解释的比例大幅降低,使用五个而不是四个时ankle变量无法被解释的比例大幅降低。为了判断最佳的主成分个数,稍后我们会用k-fold交叉验证方法选出最适合预测问题的主成分个数。

下面用偏最小二乘法建立模型, 要安装pls包:

require("pls")
## Loading required package: pls
## 
## Attaching package: 'pls'
## The following object is masked from 'package:stats':
## 
##     loadings
fit.pls <- plsr(BODYFAT~.,data=data.scaled,scale=T,validation="CV")
summary(fit.pls)
## Data:    X dimension: 230 14 
##  Y dimension: 230 1
## Fit method: kernelpls
## Number of components considered: 14
## 
## VALIDATION: RMSEP
## Cross-validated using 10 random segments.
##        (Intercept)  1 comps  2 comps  3 comps  4 comps  5 comps  6 comps
## CV           1.002   0.7509   0.6177   0.5880   0.5711   0.5549   0.5485
## adjCV        1.002   0.7507   0.6171   0.5872   0.5700   0.5531   0.5469
##        7 comps  8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## CV      0.5452   0.5446   0.5451    0.5455    0.5451    0.5454    0.5446
## adjCV   0.5437   0.5432   0.5436    0.5440    0.5437    0.5439    0.5431
##        14 comps
## CV       0.5449
## adjCV    0.5434
## 
## TRAINING: % variance explained
##          1 comps  2 comps  3 comps  4 comps  5 comps  6 comps  7 comps
## X          65.68    76.50    82.52    86.59    89.45    92.06    93.90
## BODYFAT    44.47    63.39    67.43    69.98    72.38    72.88    73.06
##          8 comps  9 comps  10 comps  11 comps  12 comps  13 comps
## X          95.06    96.24     97.35     98.16     98.93     99.47
## BODYFAT    73.08    73.10     73.12     73.15     73.19     73.25
##          14 comps
## X          100.00
## BODYFAT     73.26

由于每次运行时所得的adjCV的值不一样,经过多次重复后发现我们可以使用2到5个成分,其中使用2到3个成分即可很好的解释因变量方差。下面我们分别对lasso回归得到的模型,2到5个主成分的主成分分析模型,2到5个主成分的偏最小二乘模型用k-fold交叉验证方法选出最好的模型。为了保证cv得出的结果的数量级一致,我们都用正规化后的数据进行分析:

安装boot包

require(boot)
## Loading required package: boot
## 
## Attaching package: 'boot'
## The following object is masked from 'package:psych':
## 
##     logit
## The following object is masked from 'package:car':
## 
##     logit

对lasso模型用10-fold 交叉验证 100次并计算adjusted cv的平均值:

  data.lasso.scaled <- as.data.frame(scale(data.lasso, center = T, scale = T))
cv.err <- rep(0,30)
for(i in 1:30){

  glm.fit <- glm(BODYFAT ~ .,data = data.lasso.scaled)
  cv.err[i] = cv.glm(data.lasso.scaled,glm.fit,K=10)$delta[2]
}
plot(cv.err)

print(mean(cv.err))
## [1] 0.2992133

对主成分分析模型进行交叉检验的函数,n为主成分数

get.cv.prin <- function(n)
{
covmatrix <- cov(data.scaled[,-1])
eigen.vector <- eigen(covmatrix)$vectors
eigen.vector.n <- eigen.vector[,1:n]
prin.data.n <- as.matrix(data.scaled[,-1]) %*% eigen.vector.n
data.frame.n <- as.data.frame(cbind(data.scaled[,1], prin.data.n))
cv.err <- rep(0,30)
for(i in 1:30) {
  glm.fit <- glm(V1 ~ V2 + V3, data = data.frame.n)
  cv.err[i] = cv.glm(data.frame.n,glm.fit,K=10)$delta[2]
}
plot(cv.err)
print(mean(cv.err))
}
get.cv.prin(2)

## [1] 0.4840962
get.cv.prin(3)

## [1] 0.4852676
get.cv.prin(4)

## [1] 0.4841474
get.cv.prin(5)

## [1] 0.4835962

pls的cv在之前已经给出来了,最后发现竟然是lasso优于主成分优于偏最小二乘?我觉得交叉检验这一块有问题做的很不好,先来个初稿把